An intensity-expansion method to treat non-stationary time series: 
an application to the distance between prime numbers. 
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We study the fractal properties of the distances between consecutive primes. The distance se- 
quence is found to be well described by a non-stationary exponential probability distribution. We 
propose an intensity-expansion method to treat this non-stationarity and we find that the statis- 
tics underlying the distance between consecutive primes is Gaussian and that, by transforming the 
distance sequence into a stationary one, the range of Gaussian randomness of the sequence increases. 
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I. INTRODUCTION 



Many complex systems have been empirically shown 
to be characterized by scale invariance [l| and the cor- 
rect evaluation of the scaling exponents is of fundamen- 
tal importance to assess if universality classes exist [2|. 
However, a correct interpretation of the results of the 
statistical analysis often requires an examination of the 
stationarity of the time series under study. Stationarity 
means that the probability distribution for a given pro- 
cess is invariant under a shift of the time origin In 
particular, if {X{\ is a time series collecting the measure 
of an physical observable, the probability density func- 
tion (pdf) p(X, t) is stationary if it remains invariant by 
shifting the origin of time. 

However, time series are often not stationary. Their 
pdf changes in time and caution should be taken to ana- 
lyze non-stationary time series Here, we suggest 
a method to reduce the non-stationarity of a time series 
through an intensity-expansion method that transforms 
the original non-stationary time series into a new one 
that satisfies the stationary condition. The idea is to 
study the evolution of the pdf of the time series {Xi} 
and to look for a law that describes the non-stationarity 
of the pdf p(X, t) of the time series. Finally, we trans- 
form the original data into a new time series {Yi} such 
that Y=F(X,t). The new time series {Yi} is described by 
a pdf p(Y) such that 



p(Y)dY = p(X,t) dX 



(1) 



Eq. (JTJ assures that the transformed time series {Yi} no 
longer depends on time and, therefore, is now stationary. 
So, the latter time series may be studied via conventional 
stochastic methods. 

As an example, we apply our method to a well known 
non-stationary time series in mathematics, the distance ( 
or waiting time) between two consecutive prime numbers. 
Fig. 1 shows the first 1000 data points of a 110 million 
point time series. The main source of non-stationarity 



is due to the fact that the average distance between two 
consecutive primes tends to increases as the numbers in- 
crease. In fact, Gauss, in an 1849 letter to the astronomer 
Hencke, stated that he had found a function that approx- 
imately gives the number of primes up to the number N, 
that is, the well-known log-integral function 



N 



Li(N) 



dt 



log(t) 



(2) 



Eq. (J2J shows that the distance between consecutive 
primes is non-stationary because its average A(Nq,N) 
in the range [No, No + N] depends on the origin No ac- 
cording to the approximate relation 



A(A ,A) 



N 



Li(N + N) - M{No) 



(3) 



The non-stationary is due to the fact that the log-integral 
function, Li{N), is not linear in N . 

This problem is represented in many different ways and 
the most compact form of it can be found in the Riemann 
hypothesis \J\. This problem, in the opinion of many 
mathematicians, is probably today's most important un- 
solved problem in pure mathematics. Many references 
to the Riemann hypothesis and its early history can be 
found in Landau Q and Edwards @. 

However, Eq. @ gives only approximate informa- 
tion about the mean value and does not say anything 
about the properties of the fluctuations of the distance 
between consecutive primes. Therefore, several questions 
arise, including the particular form of the pdf of the dis- 
tance between primes; how the pdf evolves in the time 
and wheather the fluctuations contain some form of long- 
range correlation. Finally, we investigate the role played 
by the non-stationary properties of the time series and 
how to handle these properties. 

In Sec. II we review two complementary statistical 
methods used to detect scaling properties in a time series. 
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In Sec. Ill we analyze the sequence of distance between 
primes and in Sec. IV we draw some conclusion. 



II. SCALING ANALYSIS 

In the analysis of the distance between prime num- 
bers we make use of pdf analysis and of two complemen- 
tary scaling analysis methods: the standard deviation 
analysis (SPA) and the diffusion entropy analysis (DEA) 
|ToL ITU Il2l IT3I fLU . The need for using these two meth- 
ods to analyze the scaling properties of a time series is to 
discriminate the stochastic nature of the data: Gaussian 
or Levy [T^ |. 

We analyze the scaling exponents of the diffusion pro- 
cess generated by a time series. The SDA allows one to 
determine the scaling exponent of the standard deviation 
of the diffusion pdf p(x, t) with time, that is usually called 
the Hurst exponent H . Whereas DEA allows one to de- 
termine the scaling exponent 5 of the same pdf of the dif- 
fusion process under study. Finally we compare H and S. 
In the particular case in which the data are characterized 
by Gaussian statistics, it is possible to prove that H = S . 
If H =^ S the scaling presents anomalous behavior. Ran- 
dom or uncorrelated Gaussian noise is characterized by 
H = S = 0.5. If the noise is long-range correlated we have 
< H < 0.5 for antipersistent noise and 0.5 < H < 1 for 
persistent noise p] . We stress that the fractal correlation 
properties detected by these techniques are accurate only 
if the time series used to generate the diffusion process 
is stationary. In fact, non-stationarity may be mistaken 
for correlations because they yield a memory that may 
gen erate an anomalous behavior in the scaling exponents 
Ufl. So, some caution should be taken in treating non- 
stationary signals. 

According to the prescription of Ref. , we interpret 
the numbers of a time series as generating diffusion fluc- 
tuations and we shift our attention from the time series 
to the pdf p(x, i), where x denotes the variable collecting 
the fluctuations. The scaling property of the pdf takes 



p(x,t) 



F 



(?) 



(4) 



where S is the scaling exponent. Practically, let us con- 
sider a sequence of N numbers 



6, i = 1, 



,N. 



(5) 



The goal is to establish the possible existence of scaling, 
either normal or anomalous in this sequence. First of all, 
let us select an integer number t, fitting the condition 
1 < t < N. This integer number will be referred us as 
the "diffusion time" . For any given time t we can find 
M (t) = N — t + 1 sub-sequences defined by 



As) = p 



with s = 0, . . . , N — t. 



(0) 



For any of these sub-sequences we build up a diffusion 



trajectory, s, defined by the position 



(7) 



The direct evaluation of the variance is probably the 
most natural method of variance scaling detection. All 
trajectories start from the origin x(t = 0) = 0. With 
increasing time t, the sub-sequences generate a diffu- 
sion process. At each time t, it is possible to calculate 
the standard deviation of the position of the M(t) sub- 
sequences with the well known expression for the stan- 
dard deviation: 



D(t) = 



x(t) 



M{t) - 1 



(8) 



where x(t) is the average of the positions of the M{t) 
sub-trajectories at time t. The exponent H is defined for 
a scaling diffusion process by 



D(t) oc t H . 



(9) 



The DEA, is based upon the following algorithm. We 
partition the cc-axis into cells of size e(t) and label the 
cells by i = 1,2,.... We count how many particles are 
found in the same cell at a given time t. We denote this 
number by Ni (t) . Then we use this number to determine 
the probability that a particle can be found in the i-th 
cell at time t, Pi(t), by means of 



Pi(t) 



Ni(t) 
M(t) ■ 



(10) 



At this stage the entropy of the diffusion process at time 
t can be determined and reads 



(11) 



The easiest way to proceed with the choice of the cell size, 
e(t), is to assume it to be a fraction of the square root 
of the variance of the fluctuations and consequently 
independent of t. If the scaling condition (@J holds true, 
it is easy to prove that 



S(t) = A + S ln(t) , 



(12) 



where, in the continuous approximation, 

/oc 
dyF(y) \n[F(y)} , (13) 
-00 

with y = x/t s . The scaling in Eqs. JHJ and (fT2"|l deter- 
mine the exponents H and 6, respectively. 

The comparison between DEA and some other tradi- 
tional techniques of scaling detection, among them the 
detrended fluctuation analysis |l4j and a wavelet vari- 
ance based method, for both Levy and Gauss statistics, 
are made in Ref. . 
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III. PRIME NUMBER ANALYSIS 

The time series here under study is the distance Xi 
between two consecutive prime numbers. For example, 
for the first 6 prime numbers (pi = 2,3,5,7,11,13...) 
the distances are Xi = p i+1 — pi = 1, 2, 2, 4, 2 . . . with 
the index i = 1,2,.... Fig. 1 shows the distance Xi of 
the first 1000 prime numbers. The entire time series that 
we analyze contains almost 110 million distances that 
correspond to all prime numbers between 1 and 2 32 , that 
is, the largest integer handled by a 32 bits computer. 

First, we analyze the stationarity of our time series. To 
do this we partition the entire time series into 110 con- 
secutive subsets of 1 million data points each and study 
the distribution of the data for each subset. Fig. 2 shows 
the distribution of the distance of the first and last 
subset. The figure shows that the distribution of the data 
are well fitted by an exponential distribution of the kind 

p(X, n) = A(n) exp [-k(n)X] (14) 

where the integer n is the index of the subset and 
r(n) = l/k(n) is the characteristic distance between two 
primes. The dependence of the characteristic time r(n) 
on the subset index n indicates that the time series of 
distance Xi between two consecutive prime numbers is 
not stationary and some caution should be taken in the 
analysis of this time series. We observe that Kumar et 
at recently noticed that this distribution follows an ex- 
ponential form [15). but they did not notice that this 
esponential form is not stationary. 

The non-stationarity of the distances between primes 
is depicted in Fig. 3. We plot the value of the expo- 
nential fitting parameters k(n) for all 110 subsets. We 
see that the values k{n) decrease and fluctuate around a 
logarithmic curve of the type 

k(n) = a + blog{n) + 0(log(n) 2 ) (15) 

where the best least-square fitting parameters are a = 
0.074 ± 0.002 and b = -0.0041 ± 0.0002. The decrease 
of k(n) implies that the average distance between two 
primes, that is related to t(ti), tends to grow with the 
size of the numbers themselves. Fig. 4 shows the values 
of the fitting coefficients A(n) of Eq. JIIJ) and Fig. 5 
shows the mean value of the distance between consecutive 
primes for each one million data subset compared with 
the approximated mean value given by 

A(n) « J X p(X, n) dX = ^ , (16) 


that is obtained by using the pdf (|14fl . 

Eq. (|15H expresses only the linear approximation to the 
non-stationary behavior of the distances between primes 
because k(n) cannot become negative since the charac- 
teristic time, r(n), must be positive. Thereforefc(n) can, 



at most, only asymptotically converge to zero. However, 
Fig. 3 gives some indication of the goodness of the lin- 
ear fitting of Eq. 115JI in the range of numbers that we 
consider here. 

Eqs. H14I15|I a l so suggest how to reduce the non- 
stationarity of the distance series between primes. In 
fact, we can use the time series of Xi and the equa- 
tion l|15|) to define an intensity-expanded series Yi, with 
z = 1,2,... in the following way: 

Y, = k(10- 6 i) X t . (17) 

Note that we substitute the integer n with 10 -6 i be- 
cause in our notation n represents the index of the n — th 
subset of 1 million distances between primes. In fact, it 
is easy to realize that according to the Eqs. (|14|> and 
(HHJl the new time series Yi is more stationary than the 
previous series of distances Xi because the values k(n) 
for the time-expanded time series will fluctuate around 
the constant value k(n) ~ 1. So, each subset of 1 million 
data of the time-expanded series Yi are distribute around 
the same probability density function piY) ks exp [-Y]. 
This property assures an approximate stationarity of the 
intensity-expanded series Yi and allows the adoption of 
analyses that require the stationarity of the dataset. So, 
we conclude that in the particular case of the distance 
between primes the transformation i|17fl satisfies the con- 
dition expressed by Eq. 

Figs. 6 and 7 show respectively the DEA and SDA 
applied to both datasets of the original distances Xi and 
of the time-expanded series Yi . Both figures show that by 
increasing the diffusion time t both scaling exponents 5 
and H change from the value 8 = H = 0.5 at short range 
to 5 = H = 0.88 at long range. The fact that the two 
scaling exponents are the same indicates that the data 
are consistent with Gaussian statistics. 

Figs. 6 and 7 allow us to conclude that the statis- 
tics of the distance between primes can be considered 
well described by random noise plus a temporal ampli- 
fication due to the non-stationary mechanism indicated 
in Figs. 2-5 and described by Eqs. i|15|) an( l dU- In 
fact, both DEA and SDA of the distances between primes 
(shown in the figure by the curves with triangles) have 
the scaling exponents 8 = H = 0.5 until t ~ 50 and 
have 8 = H = 0.88 for t > 50. With these results it 
is possible to mistake the distances between primes for 
a random signal that is practically uncorrelated at short 
range and strongly correlated at long range. However, 
the curves concerning the DEA and SDA applied to the 
intensity-expanded series Yi given by (|17|) suggest a dif- 
ferent interpretation. In fact, with a simple time trans- 
formation finalized to reduce the non-stationary prop- 
erties of the original time series, it is possible to ex- 
tend the uncorrelated range from a few decades, that 
is, t < 50 to t < 10000 as Fig. 6 and 7 clearly show 
(curves with circles). Thus, the results shown in Fig. 6 
and 7 suggest that the high value of the scaling expo- 
nents, 8 = H = 0.88, at high range, at least between 
50 < t < 10000, is not a manifestation of long range 
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fractal noise but of the non-stationarity of the original 
time series. Perhaps, with better fitting procedures and 
an higher order fitting function l)15[l. it will be possible 
to extend the range of uncorrelated noise further. 

IV. CONCLUSION 

In this paper we have introduced a method to trans- 
form a non-stationary time series into a new time series 
that better satisfies the stationary condition. We study 
the evolution of the probabilistic structure of the original 
time series {A^}, that is, we study the non-stationary pdf 
p(X, t) of the original time series and, finally, we trans- 
form the original data into a new time series {Y{\ through 
an intensity-expansion mechanism such that Y=F(X,t) 
and such that the pdf P(Y) is independent of t. 

The DEA and SDA techniques are used to study the 



statistics of the distances between prime numbers. From 
this study it is seen that the distance between primes 
generates a non-stationary time series due to an increas- 
ing mean distance between primes. This non-stationarity 
is responsible for the apparently persistent fractal nature 
of the time series at large time-range. However, upon 
removal of the nonstationary components of this time se- 
ries, the near neighbor primes show a great deal of ran- 
domness for a much longer range. This may imply that a 
deeper understanding of the distribution of primes along 
the real axis may be gained by conducting a similar study 
with much larger datasets. This technique, does however 
suggests that the distribution of primes in consistent with 
Gaussian statistics. 
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FIG. 1: The distance Xi between of the first fOOO consecutive prime numbers. 




FIG. 2: Distribution of the distance Xi of the first subset of 1 million data and for the 110— th subset. The distribution are 
fitted by an exponential distribution 1141 . 




FIG. 3: The exponential fitting parameters k(n) for all 110 subsets the distance X, 
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FIG. 5: The mean value of the distance between consecutive primes for each one million data subset (circles) compared with 
the approximated mean value given by Eq. iJTBJ (t riangles). The error is almost 10%. 
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FIG. 6: DEA of the original distance dataset Xi (triangles) and of the time-expanded dataset Yi (circles). 




FIG. 7: DSDA of the original distance dataset Xi (triangles) and of the time-expanded dataset Yi (circles). 



